Intro to R for Medical Data

A companion document for the R/Medicine 2021 Virtual Conference’s Introduction to R for Medical Data workshop.

true , true , true
2021-08-23

Welcome to R/Medicine 2021. In this workshop we will be doing a hands-on introduction to R for medical data.

Workshop basics

Why should I come to this workshop?

Set up for workshop

You’ll need the following during the workshop:

Please note while you can use a local version of R and Rstudio for the workshop, we cannot provide support for you during the workshop to address issues that come up your local device and software. We use RStudio Cloud because we can do all the technical preparation for participants ahead of time and spend our limited workshop time on learning R as fast as possible.

If you would like to some additional preparation for the workshop, I recommend learning some about Markdown in this ten minute tutorial.

Schedule

Introduction ~10 minutes

Systems check ~10 minutes

Learn and Do ~120 minutes

Wrap Up and Discussion ~10 minutes

Starting your journey in R

Orientation

RStudio Cloud

For this workshop we will be using RStudio Cloud.

Why? Because there is nothing that you need to download!

You don’t need to use RStudio to use R, but I do!

RStudio on your computer and RStudio Cloud look very similar.

R and RStudio are free to download to your computer.

Systems Check

Open up the project

Once you are in RStudio Cloud, click on the project called “Intro to R for Medical Data.”

RStudio Orientation: Basics

R Markdown Orientation

Let’s open the file in the file pane (number 4 in the image above) called Intro_to_R_for_medical_data_workshop.Rmd. This is what we will be working for the workshop.

In general, Rmd documents will have three types of components.

YAML

The YAML header contains the special instructions on how to create the output document. We won’t do much with it here today, but it is a very powerful way to make your Rmd file as bespoke as you want it!

Code Chunks

Code chunks are where the code will go.

Code chunks have a gray background.

#This is a code chunk!
#Here is a simple calculation
1 + 2
[1] 3

You can run a code chunk by pressing the green play button.

Part 1: Exploring Data

Data in a Excel spreadsheet?

No problem! R can ingest data from lots of locations including from Excel spreadsheets that you might already be very familiar with.

Load data into R

Run the code chunk below by pressing that green play button.

library(readxl)
smoke_complete <- read_excel("./data/smoke_complete.xlsx")

We have now loaded some data about patients with different types of cancer. You can view the source of this data here.

Inspecting a data frame

One thing that Excel does well is to provide an interactive visual representation of the data. This allows you to inspect it by sorting and filtering. RStudio actually does this well, too, with one difference - it won’t let you change any of the data while you inspect it.

Viewing your data

Look on the right at the Environment pane (you might have to click icon that looks like a spreadsheet on the “Environment” tab) and find the entry smoke_complete. This is the data frame you just created inside of R’s memory. (If you don’t see smoke_complete, try running the code chunk above again).

Within the Environment pane, click on the smoke_complete to view the data (not the blue arrow).

Go ahead and try to edit one of the values in this viewer. You will find that you can’t. It would have been easy for the RStudio programmers to allow editing of specific values, but they decided not to add that feature.

  1. Why do you think this was designed so you cannot edit specific values?

Getting a quick glimpse

Next we will use the glimpse() function from {dplyr} to learn about our data.

library(dplyr)
glimpse(smoke_complete)
Rows: 1,152
Columns: 20
$ primary_diagnosis           <chr> "C34.1", "C34.1", "C34.3", "C34.~
$ tumor_stage                 <chr> "stage ia", "stage ib", "stage i~
$ age_at_diagnosis            <dbl> 24477, 26615, 28171, 27154, 2337~
$ vital_status                <chr> "dead", "dead", "dead", "alive",~
$ morphology                  <chr> "8070/3", "8070/3", "8070/3", "8~
$ days_to_death               <chr> "371", "136", "2304", "NA", "NA"~
$ state                       <chr> "live", "live", "live", "live", ~
$ tissue_or_organ_of_origin   <chr> "C34.1", "C34.1", "C34.3", "C34.~
$ days_to_birth               <dbl> -24477, -26615, -28171, -27154, ~
$ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C34.~
$ days_to_last_follow_up      <chr> "NA", "NA", "2099", "3747", "357~
$ cigarettes_per_day          <dbl> 10.9589041, 2.1917808, 1.6438356~
$ years_smoked                <chr> "NA", "NA", "NA", "NA", "NA", "N~
$ gender                      <chr> "male", "male", "female", "male"~
$ year_of_birth               <chr> "1936", "1931", "1927", "1930", ~
$ race                        <chr> "white", "asian", "white", "whit~
$ ethnicity                   <chr> "not hispanic or latino", "not h~
$ year_of_death               <chr> "2004", "2003", "NA", "NA", "NA"~
$ bcr_patient_barcode         <chr> "TCGA-18-3406", "TCGA-18-3407", ~
$ disease                     <chr> "LUSC", "LUSC", "LUSC", "LUSC", ~
  1. How many rows of data do we have in our data?
  2. How many columns are there? What are some of the column names?
  3. Can you tell what kind of variable is stored in the columns?

Getting quick summary statistics

Next, we will use the skim() function to learn even more about our data.

Press the green play button in the code chunk below.

  1. What are the different column types we have?
  2. What types are in this data?

Look at the mini histograms at the bottom of the output? What can you learn very quickly about your data for the distribution cigarettes_per_year variable

library(skimr)
skim(smoke_complete)
Table 1: Data summary
Name smoke_complete
Number of rows 1152
Number of columns 20
_______________________
Column type frequency:
character 17
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
primary_diagnosis 0 1 5 6 0 17 0
tumor_stage 0 1 7 12 0 11 0
vital_status 0 1 4 5 0 2 0
morphology 0 1 6 6 0 12 0
days_to_death 0 1 1 4 0 279 0
state 0 1 4 4 0 1 0
tissue_or_organ_of_origin 0 1 5 5 0 16 0
site_of_resection_or_biopsy 0 1 5 5 0 16 0
days_to_last_follow_up 0 1 1 4 0 459 0
years_smoked 0 1 1 2 0 48 0
gender 0 1 4 6 0 2 0
year_of_birth 0 1 2 4 0 67 0
race 0 1 5 41 0 6 0
ethnicity 0 1 12 22 0 3 0
year_of_death 0 1 2 4 0 21 0
bcr_patient_barcode 0 1 12 12 0 734 0
disease 0 1 4 4 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age_at_diagnosis 0 1 24175.38 3926.13 7855.00 22068.75 24750.50 26927.00 32872 <U+2581><U+2581><U+2583><U+2587><U+2582>
days_to_birth 0 1 -24175.38 3926.13 -32872.00 -26927.00 -24750.50 -22068.75 -7855 <U+2582><U+2587><U+2583><U+2581><U+2581>
cigarettes_per_day 0 1 2.61 2.04 0.01 1.37 2.19 3.29 40 <U+2587><U+2581><U+2581><U+2581><U+2581>

Part 2: Plotting our data

Now we will take the data we have loaded and make some plots!

Produce a histogram of smoke_complete using geom_histogram(), mapping these variables to the following aesthetics:cigarettes_per_day to the x aesthetic. Save this plot to the variable plot_1 and show it.

library(ggplot2)

plot_1 <- ggplot(data = smoke_complete,
                 aes(x =  cigarettes_per_day)) +
  geom_histogram()

plot_1

We’ve made a plot, but there are some issues with it:

Let’s make our plot a bit prettier.

plot_2 <- ggplot(data = smoke_complete,
                 aes(x =  cigarettes_per_day)) +
  geom_histogram() +
  labs(title = "An excellent title for your plot",
       x = "Cigarettes Per Day",
       y = "Count") +
  theme_classic()

plot_2

You got a message after running that code that: “stat_bin() using bins = 30. Pick better value with binwidth.”

Let’s modify the width of our histogram’s bins in the code chunk below to 2. Then try 10.

plot_3 <- ggplot(data = smoke_complete,
                 aes(x =  cigarettes_per_day)) +
  geom_histogram(binwidth = 4) +
  labs(title = "An excellent title for your plot",
       x = "Cigarettes Per Day",
       y = "Count") +
  theme_classic()

plot_3

Finally let’s explore the differences in daily cigarettes for the gender variable by adding a fill color and facet_grid.

Below add gender to the facet.

plot_4 <- ggplot(data = smoke_complete,
                 aes(x =  cigarettes_per_day,
                     fill = gender)) +
  geom_histogram(binwidth = 2) +
  labs(title = "An excellent title for your plot",
       x = "Cigarettes Per Day",
       y = "Count",
       fill = "Gender") +
  theme_classic() +
  facet_grid(. ~ gender)

plot_4 

ggplot2: A Grammar of Graphics

ggplot2 is an extremely powerful software library for visualization.

The gg is short for Grammar of Graphics, which means that visualizations are expressed in a specific and consistent way for different types of visualizations in the package. There is a wonderful universe of complementary packages in R that extend ggplot to many different types of visualizations.

Here’s a visual summary of the different parts we’re talking about today. There are many parts to visualizations, but many of us don’t have the words to describe the different types of parts of a graph. Here we will spend some time breaking down the meeting of different constituent parts.

Note: there are different ways to write ggplot code to get the same output, sort of formal versus causal conversation. Here we will focus on the more formal style because it is the most explicit, but also more verbose! As you get familiar with ggplot, you will likely shorten your code a bit at times from this very explicit method. We won’t cover every layer in depth, but know that breaking down visualizations to these constituents allows for tremendous control.

Image from Thomas Lin Pedersen’s ggplot2 workshop

Learning to read ggplot2 code

A ggplot2 graphic consists of a mapping of variables in data to the aes()thetic attributes of geom_etric objects.

In code, this is translated as:

    # start the plot object with ggplot()
    # add assign smoke_complete to the data argument

    ggplot(data = smoke_complete,   

    # map the variables to visual properties of the graph
           mapping = aes(
            
    # map the x-axis to age_at_diagnosis
              x = age_at_diagnosis, 
              
    # map the y-axis to cigarettes_per_day       
              y = cigarettes_per_day,
    
    # map the color to the disease variable
              color = disease
              )) +

    # add the geometry and the alpha
          geom_point(alpha = 0.2) + # complete the geom to geom_point
  
    # add labels to your plot

          labs(title = "Age of diagnosis of cancer by daily cigarette consumption", # plot title
               x="Age in days", # x-axis label
               y="Cigarettes per day", # y-axis label
               color = "Disease type") + # label for color legend
  
    # add a facet to your plot
  
          facet_grid(cols = vars(gender)) + # facet by gender variable
  
    # add a theme to your plot
  
          theme_bw()
Full Example

Figure 1: Full Example

Things to note: we chain these things together with + (plus sign).

Rebuild the prior plot

Set the data argument to smoke_complete, then run the code chunk.

ggplot(data = smoke_complete)

Add mapping arguments to the aes function. Assign the variable age_at_diagnosis to x, cigarettes_per_day to y, and disease to color.

Run the code chunk. How does it look different from the prior code chunk?

ggplot(data = smoke_complete, mapping = aes(x = age_at_diagnosis,
                                            y = cigarettes_per_day,
                                            color = disease))

Add geom_point().

ggplot(data = smoke_complete, mapping = aes(x = age_at_diagnosis,
                                            y = cigarettes_per_day,
                                            color = disease)) +
      geom_point()

Ok, wow, looking great!! Any major differences you notice between the graph created by our step by step code chunk called Figure 1 and the graph you just made above?

In the next modification, modify the alpha argument to vary the opacity of the points. You can vary the value from 0 to 1. Try a few different values to see the difference.

ggplot(data = smoke_complete, mapping = aes(x = age_at_diagnosis,
                                            y = cigarettes_per_day,
                                            color = disease)) +
      geom_point(alpha = 0.2)

Let’s add some lovely, informative labels to our graph.

plot_5 <- ggplot(data = smoke_complete, mapping = aes(x = age_at_diagnosis,
                                            y = cigarettes_per_day,
                                            color = disease)) +
      geom_point(alpha = 0.2) +

      labs(title = " ", 
           x= " ",  
           y= " ",  
           color = " ") 

plot_5

Wait! What little trick was that? See the variable plot_5 in the code chunk above? That is storing your whole code with your nice new labels. Instead of continuing to rewrite your code, you can just modify as needed.

Below add gender to the cols argument of facet_grid() to separate out male and female to inspect the differences.

plot_5 +
  
      facet_grid(cols = vars(gender)) 

Finally we will play a bit with different themes. Pick two of the below themes and then add them to the plot_5 and run the code chunk to compare them.

Theme options:

plot_5 + theme_classic()
plot_5 + theme_dark()

There are plenty of out of the box themes in ggplot2, but you can make a theme, use other people’s, or even use a theme your organization might already have. Let’s load the `{ggthemes}’ library.

plot_5 + theme_economist() + labs(subtitle = "The Economist Style")
plot_5 + theme_wsj() + labs(subtitle = "WSJ Style")
plot_5 + theme_tufte() + labs(subtitle = "Tufte Style")

Time for a break!!

Part 3: Wrangling Data

Rarely is your data going to be in the form you need it to be analyzed and plotted. You will often need to wrangle your data and change the shape of it a bit.

Let’s discuss a bit different ways we might need to process data.

Discussion

There are different packages and ways people wrangle data with R, but we’re going to demonstrate using some packages from the {tidyverse}, which is a whole ecosystem of R packages organized around having tidy data.

select()

From our smoke_complete data, let’s select two columns to keep: gender and days_to_death.

library(dplyr)
select(smoke_complete, gender, days_to_death)   # select columns to keep
# A tibble: 1,152 x 2
   gender days_to_death
   <chr>  <chr>        
 1 male   371          
 2 male   136          
 3 female 2304         
 4 male   NA           
 5 female NA           
 6 male   345          
 7 male   716          
 8 male   2803         
 9 male   973          
10 male   1097         
# ... with 1,142 more rows

There is a lovely package called {magrittr} that was loaded earlier, but includes something called a pipe, that looks like this in code: %>% (pronounced “then”). It allows use to call a nicely pipe data and preform lots of tasks on it.

Run the code below and inspect the output

  1. How does it compare to the output from the prior code chunk?
smoke_complete %>% select(gender, days_to_death) 
# A tibble: 1,152 x 2
   gender days_to_death
   <chr>  <chr>        
 1 male   371          
 2 male   136          
 3 female 2304         
 4 male   NA           
 5 female NA           
 6 male   345          
 7 male   716          
 8 male   2803         
 9 male   973          
10 male   1097         
# ... with 1,142 more rows

filter()

Now let’s meet filter().

filter(smoke_complete, bcr_patient_barcode == "TCGA-18-3412")  # filter to specific rows
# A tibble: 2 x 20
  primary_diagnosis tumor_stage age_at_diagnosis vital_status
  <chr>             <chr>                  <dbl> <chr>       
1 C34.3             stage ib               19025 dead        
2 C34.3             stage ib               19025 dead        
# ... with 16 more variables: morphology <chr>, days_to_death <chr>,
#   state <chr>, tissue_or_organ_of_origin <chr>,
#   days_to_birth <dbl>, site_of_resection_or_biopsy <chr>,
#   days_to_last_follow_up <chr>, cigarettes_per_day <dbl>,
#   years_smoked <chr>, gender <chr>, year_of_birth <chr>,
#   race <chr>, ethnicity <chr>, year_of_death <chr>,
#   bcr_patient_barcode <chr>, disease <chr>

The pipe alternative way.

smoke_complete %>% 
  filter(bcr_patient_barcode == "TCGA-18-3412")
# A tibble: 2 x 20
  primary_diagnosis tumor_stage age_at_diagnosis vital_status
  <chr>             <chr>                  <dbl> <chr>       
1 C34.3             stage ib               19025 dead        
2 C34.3             stage ib               19025 dead        
# ... with 16 more variables: morphology <chr>, days_to_death <chr>,
#   state <chr>, tissue_or_organ_of_origin <chr>,
#   days_to_birth <dbl>, site_of_resection_or_biopsy <chr>,
#   days_to_last_follow_up <chr>, cigarettes_per_day <dbl>,
#   years_smoked <chr>, gender <chr>, year_of_birth <chr>,
#   race <chr>, ethnicity <chr>, year_of_death <chr>,
#   bcr_patient_barcode <chr>, disease <chr>

Together select() and filter() will be big workhorses in your data wrangling toolkit. However, at first it is easy to confuse which function does what.

select() selects the columns to stay.

filter() filters the rows to keep that meet certain conditions placed on columns. So in the example above, we filtered the data to only keep the rows where the bcr_patient_barcode was equal to “TCGA-18-3412.”

%>%

We meet the pipe %>% earlier, but let’s see how we can combine multiple pipes in a row to get a more complicated operation done.

smoke_complete %>% 
  filter(year_of_birth <= 1960) %>% 
  select(age_at_diagnosis, gender)
# A tibble: 1,032 x 2
   age_at_diagnosis gender
              <dbl> <chr> 
 1            24477 male  
 2            26615 male  
 3            28171 female
 4            27154 male  
 5            23370 female
 6            19025 male  
 7            26938 male  
 8            28430 male  
 9            30435 male  
10            24019 male  
# ... with 1,022 more rows

mutate()

Let’s make age_at_diagnosis more human friendly by creating a new column that divides the days by 365.25. (There are great R packages to handle data, times, duration, intervals, and all those other message time and date issues!)

mutate also lets you create multiple “mutations” (e.g., calculations) at the same time.

table_1 <- smoke_complete %>%
  mutate(age_at_diagnosis_years = age_at_diagnosis/365.25,
         age_at_diagnosis_years_round = round(age_at_diagnosis_years, 1)) %>% 
  select(age_at_diagnosis, age_at_diagnosis_years, age_at_diagnosis_years_round)

table_1
# A tibble: 1,152 x 3
   age_at_diagnosis age_at_diagnosis_years age_at_diagnosis_years_rou~
              <dbl>                  <dbl>                       <dbl>
 1            24477                   67.0                        67  
 2            26615                   72.9                        72.9
 3            28171                   77.1                        77.1
 4            27154                   74.3                        74.3
 5            23370                   64.0                        64  
 6            19025                   52.1                        52.1
 7            26938                   73.8                        73.8
 8            28430                   77.8                        77.8
 9            30435                   83.3                        83.3
10            24019                   65.8                        65.8
# ... with 1,142 more rows

Getting counts

Let’s see how many unique values there are in each column of our dataset. The skim function from earlier will report the number of unique values in a column in the n_unique column.

skim(smoke_complete)
(#tab:skim_smoke)Data summary
Name smoke_complete
Number of rows 1152
Number of columns 20
_______________________
Column type frequency:
character 17
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
primary_diagnosis 0 1 5 6 0 17 0
tumor_stage 0 1 7 12 0 11 0
vital_status 0 1 4 5 0 2 0
morphology 0 1 6 6 0 12 0
days_to_death 0 1 1 4 0 279 0
state 0 1 4 4 0 1 0
tissue_or_organ_of_origin 0 1 5 5 0 16 0
site_of_resection_or_biopsy 0 1 5 5 0 16 0
days_to_last_follow_up 0 1 1 4 0 459 0
years_smoked 0 1 1 2 0 48 0
gender 0 1 4 6 0 2 0
year_of_birth 0 1 2 4 0 67 0
race 0 1 5 41 0 6 0
ethnicity 0 1 12 22 0 3 0
year_of_death 0 1 2 4 0 21 0
bcr_patient_barcode 0 1 12 12 0 734 0
disease 0 1 4 4 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age_at_diagnosis 0 1 24175.38 3926.13 7855.00 22068.75 24750.50 26927.00 32872 <U+2581><U+2581><U+2583><U+2587><U+2582>
days_to_birth 0 1 -24175.38 3926.13 -32872.00 -26927.00 -24750.50 -22068.75 -7855 <U+2582><U+2587><U+2583><U+2581><U+2581>
cigarettes_per_day 0 1 2.61 2.04 0.01 1.37 2.19 3.29 40 <U+2587><U+2581><U+2581><U+2581><U+2581>

Usually we want to get counts from a categorical variable, so when there are 100s of unique values, that is typically a column that we wouldn’t be counting.

Here we can use the count function to count the number of observations for each unique value of the vital_status column.

smoke_complete %>%
  count(vital_status)
# A tibble: 2 x 2
  vital_status     n
  <chr>        <int>
1 alive          647
2 dead           505

If we pass in multiple columns, it will give us a count for each combination of column values

smoke_complete %>%
  count(vital_status, disease)
# A tibble: 6 x 3
  vital_status disease     n
  <chr>        <chr>   <int>
1 alive        BLCA      121
2 alive        CESC       66
3 alive        LUSC      460
4 dead         BLCA      103
5 dead         CESC       26
6 dead         LUSC      376

If you want more of a “2x2” table of results, you can use the base-R table function. We can pass in each column of values from our data using the $ notation

table(smoke_complete$vital_status, smoke_complete$disease)
       
        BLCA CESC LUSC
  alive  121   66  460
  dead   103   26  376

We can also use the addmargins if we want to see row and column sums with our count table.

table(smoke_complete$vital_status, smoke_complete$disease) %>%
  addmargins()
       
        BLCA CESC LUSC  Sum
  alive  121   66  460  647
  dead   103   26  376  505
  Sum    224   92  836 1152

A moment for beautiful tables

There are several great packages that can make lovely tables that are publication ready.

One quick example here of our prior code chunk output table_1. We take that same object and let a package called kableExtra work some magic on it!

table_1 %>% 
  head(10) %>% #just to only print the top 10 rows
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
age_at_diagnosis age_at_diagnosis_years age_at_diagnosis_years_round
24477 67.01437 67.0
26615 72.86790 72.9
28171 77.12799 77.1
27154 74.34360 74.3
23370 63.98357 64.0
19025 52.08761 52.1
26938 73.75222 73.8
28430 77.83710 77.8
30435 83.32649 83.3
24019 65.76044 65.8

Quick preview rmarkdown

R Markdown documents, like this document, allow you to place text and analysis with the code all in a single document and output the result into different formats such as an html webpage, a pdf, or even a word document.

Let’s quickly knit a document to seee the output. In pane 4, or the file pane, find the document called knit_preview.Rmd and open it. We will see examples of knitting the same .Rmd file to a word document, a pdf, and to html.

Wrap up

How to search for help

Within R you can type ? followed by the name of the function eg ?filter() in the console.

In RStudio, you can look at the help tab in pane 4.

However, in the end your best resource is probably going to be searching online for the issue. From online forms, to blog posts, to twitter threads there is a ton of content out there, but crafting a good search query is a workshop in itself!

Extra: Standardizing variable names: clean_names()

The last package we’ll talk about today is the {janitor} package. This package adds some really useful functions for cleaning data.

The one I use the most is called clean_names(). This will mutate column names into a standard format - all character values will be lowercase and spaces will be converted to underscores.

The smoke_complete dataset has already had this function applied to it, but let’s create a smaller example dataset where this wouldn’t be the case.

library(readr)
smoke_bad_colnames <- read_csv("data/smoke_bad_colnames.csv")
smoke_bad_colnames
# A tibble: 1,152 x 4
   `primary diagnosis` `tumor Stage` `Age at Diagnosis` vitalSTATUS
   <chr>               <chr>                      <dbl> <chr>      
 1 C34.1               stage ia                   24477 dead       
 2 C34.1               stage ib                   26615 dead       
 3 C34.3               stage ib                   28171 dead       
 4 C34.1               stage ia                   27154 alive      
 5 C34.1               stage iiia                 23370 alive      
 6 C34.3               stage ib                   19025 dead       
 7 C34.3               stage iv                   26938 dead       
 8 C34.1               stage ib                   28430 dead       
 9 C34.1               stage iib                  30435 dead       
10 C34.9               stage iv                   24019 dead       
# ... with 1,142 more rows

Ack! These column names have spaces, inconsistent capitalization, which make these difficult to work with.

So let’s use clean_names() to fix this:

library(janitor)
smoke_good_colnames <- smoke_bad_colnames %>%
    clean_names()

smoke_good_colnames
# A tibble: 1,152 x 4
   primary_diagnosis tumor_stage age_at_diagnosis vital_status
   <chr>             <chr>                  <dbl> <chr>       
 1 C34.1             stage ia               24477 dead        
 2 C34.1             stage ib               26615 dead        
 3 C34.3             stage ib               28171 dead        
 4 C34.1             stage ia               27154 alive       
 5 C34.1             stage iiia             23370 alive       
 6 C34.3             stage ib               19025 dead        
 7 C34.3             stage iv               26938 dead        
 8 C34.1             stage ib               28430 dead        
 9 C34.1             stage iib              30435 dead        
10 C34.9             stage iv               24019 dead        
# ... with 1,142 more rows

By default, clean_names() converts to snake case (spaces between words are replaced by an underscore _). clean_names() will help you standardize your dataset’s column names. Knowing that everything follows a certain convention will make your data cleaning journey better.

Further Reading and Resources

ggplot2: Elegant Graphics for Data Analysis

R for Data Science

R Graphics Cookbook

The list is always growing! The trouble isn’t finding good resources, it’s finding time to read them!

Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2021. Rmarkdown: Dynamic Documents for r. https://CRAN.R-project.org/package=rmarkdown.
Arnold, Jeffrey B. 2021. Ggthemes: Extra Themes, Scales and Geoms for Ggplot2. https://github.com/jrnold/ggthemes.
Firke, Sam. 2021. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://github.com/sfirke/janitor.
Müller, Kirill. 2020. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.
R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2021. Skimr: Compact and Flexible Summaries of Data. https://CRAN.R-project.org/package=skimr.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, Hadley, and Jennifer Bryan. 2019. Readxl: Read Excel Files. https://CRAN.R-project.org/package=readxl.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2020. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2021. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and Jim Hester. 2020. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.
Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.
Zhu, Hao. 2021. kableExtra: Construct Complex Table with Kable and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.

References